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Abstract 


We study hierarchical clustering schemes under an axiomatic view. We show that within this frame- 
work, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an 
existence and uniqueness theorem instead of a non-existence result. We explore further properties 
of this unique scheme: stability and convergence are established. We represent dendrograms as 
ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to 
quantify the degree to which perturbations in the input metric space affect the result of hierarchical 
methods. 
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1. Introduction 


Clustering techniques play a very central role in various parts of data analysis. They can give 
important clues to the structure of data sets, and therefore suggest results and hypotheses in the 
underlying science. Many of the interesting methods of clustering available have been applied to 
good effect in dealing with various data sets of interest. However, despite being one of the most 
commonly used tools for unsupervised exploratory data analysis, and despite its extensive literature, 
very little is known about the theoretical foundations of clustering methods. These points have been 
recently made prominent by von Luxburg and Ben-David (2005); Ben-David et al. (2006). 

The general question of which methods are “best”, or most appropriate for a particular problem, 
or how significant a particular clustering is has not been addressed too frequently. This lack of 
theoretical guarantees can be attributed to the fact that many methods involve particular choices to 
be made at the outset, for example how many clusters there should be, or the value of a particular 
thresholding parameter. In addition, some methods depend on artifacts in the data, such as the 
particular order in which the observations are listed. 

In Kleinberg (2002), Kleinberg proves a very interesting impossibility result for the problem of 
even defining a clustering scheme with some rather mild invariance properties. He also points out 
that his results shed light on the trade-offs one has to make in choosing clustering algorithms. 

Standard clustering methods take as input a finite metric space (X,d) and output a partition 
of X. Let P(X) denote the set of all possible partitions of the set X. Kleinberg (2002) discussed 
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this situation in an axiomatic way and identified a set of reasonable properties of standard clustering 
schemes, namely, scale invariance, richness and consistency. Fix a standard clustering method f 
and a metric space (X,d) and let f(X,d) = ILe P(X). Kleinberg identified the following desirable 
properties of a clustering scheme: 


e Scale Invariance: For all a > 0, f(X,a-d) = II. 


e Richness: Fix any finite set X. Then for all II e P(X), there exists dy, a metric on X s.t. 
f(X,dn) =I. 


e Consistency: Let II = {B,,...,Be}. Let d be any metric on X s.t. 


1. for all x,x/ e By, d(x,x’) < d(x,x’) and 


2. for all x€ By, xX € By, AAO’, d(x,x’) > d(x,x’). 


~ 


Then, f(X,d) =II. 


He then proved, in the same spirit of Arrow’s impossibility theorem, that no clustering scheme 
satisfying these conditions simultaneously can exist. 


Theorem 1 (Kleinberg, 2002) There exists no clustering algorithm that satisfies scale invariance, 
richness and consistency. 


Then, in particular, Kleinberg’s axioms rule out single, average and complete linkage (standard) 
clustering. Clusters in any of these three methods can be obtained by first constructing a hierachi- 
cal decomposition of space (such as those provided by hierarchical clustering methods) and then 
selecting the partition that arises at a given, fixed, threshold. 

A natural question is whether Kleinberg’s impossibility results still holds when one admits clus- 
tering schemes that do not try to return a fixed partition of a space, but are allowed to return a 
hierarchical decomposition. 

Furthermore, data sets can exhibit multiscale structure and this can render standard clustering 
algorithms inapplicable in certain situations, see Figure 1. This further motivates the use of Hier- 
archical clustering methods. Hierarchical methods take as input a finite metric space (X,d) and 
output a hierarchical family of partitions of X. 


Figure 1: Data set with multiscale structure. Any standard clustering algorithm will fail to capture 
the structure of the data. 


These hierarchical families of partitions that constitute the output of hierarchical methods re- 
ceive the name of dendrograms. Dendrograms come in two versions: proximity and threshold 
dendrograms. These two types of dendrograms differ in whether they retain some proximity infor- 
mation about the underlying clusters that they represent or not: proximity dendrograms do retain 
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such information whereas threshold dendrograms do not. Practicioners of statistical data analysis 
seem to work almost exclusively with proximity dendrograms. For this reason we opt to carry out 
our analysis under the model that hierarchical methods take as input a finite metric space X and 
output a proximity dendrogram over X, see Remark 3. 

We remind the reader that we are using the term standard clustering methods to refer to proce- 
dures that take a finite metric space as input and output a fixed single partition of the metric space. 

In a similar spirit to Kleinberg’s theorem, we prove in Theorem 18 that in the context of hierar- 
chical methods, one obtains uniqueness instead of non-existence. We emphasize that our result can 
be interpreted as a relaxation of the theorem proved by Kleinberg, in the sense that allowing cluster- 
ing schemes that output a nested family of partitions in the form of a proximity dendrogram, instead 
of a fixed partition, removes the obstruction to existence. The unique HC method characterized by 
our theorem turns out to be single linkage hierarchical clustering. 

We stress the fact that our result assumes that outputs of hierarchical methods are proximity 
dendrograms, whereas Kleinberg’s Theorem applies to flat/standard clustering, a situation in which 
the output contains no proximity information between clusters. 

In order to state and prove our results we make use of the well known equivalent representation 
of dendrograms, the output of HC methods, using ultrametrics. This already appears in the book of 
Hartigan and others, see Hartigan (1985), Jain and Dubes (1988, §3.2.3) and references therein. 

In recent years, the theme of studying the properties of metrics with prescribed generalized 
curvature properties has been studied intensively. In particular, the work of Gromov (1987) has 
been seminal, and many interesting results have been proved concerning objects other than metric 
spaces, such as finitely generated groups, depending on these methods. The curvature conditions 
can be formulated in terms of properties of triangles within the metric spaces, and the most extreme 
of these properties is that embodied in ultrametric spaces. A second idea of Gromov’s is to make the 
collection of all metric spaces into its own metric space, and the resulting metric gives a very useful 
and natural way to distinguish between metric spaces (Gromov, 2007). This metric is known as the 
Gromov-Hausdorff distance and its restriction to the subclass of ultrametric spaces is therefore a 
very natural object to study. 


1.1 Stability 


Stability of some kind is clearly a desirable property of clustering methods and, therefore, a point 
of interest is studying whether results obtained by a given clustering algorithm are stable to per- 
turbations in the input data. Since input data are modelled as finite metric spaces, and the output 
of hierarchical methods can be regarded as finite ultrametric spaces, the Gromov-Hausdorff dis- 
tance provides a natural tool for studying variability or perturbation of the inputs and outputs of 
hierarchical clustering methods. 

After observing in §3.6 that average and complete linkage clustering are not stable in the metric 
sense alluded to above, we prove in Proposition 26 that single linkage does enjoy a kind of stability: 


Proposition 2 Let (X del and (Y,dy) be two finite metric spaces and let (X ,ux ) and (Y,ux) be the 
two (finite metric ultrametric spaces) corresponding outputs yielded by single linkage HC. Then, 


dga((X,ux), (Y, uy )) < dou ((X,dx), (Y, dy)). 


Here, doa stands for the Gromov-Hausdorff distance. 
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Figure 2: Convergence of dendrograms. We formalize this concept by equivalently representing 
dendrogram as ultrametrics and then computing the Gromov-Hausdorff distance between 
the resulting metrics. We prove in Theorem 30 that by taking increasingly many 1.1.d. 
samples from a given probability distribution u on a metric space,then with probability 1 
one recovers a multiscale representation of the supprt of u. 


This result is very important for the convergence theorems which we prove in the later parts of 
the paper. These results describe in a very precise way the fact that for compact metric spaces X, 
the results of clustering the finite subsets of X yields a collection of dendrograms which ultimately 
converge to the dendrogram for X. In order for this to happen, one needs the metric on the ultramet- 
ric spaces as well as the behavior of the clustering construction on the Gromov-Hausdorff distance, 
which is what Proposition 2 does. The issue of stability is further explored in §5. 


1.2 Probabilistic Convergence 


Finally, in Theorem 30 we also prove that for random i.i.d. observations X, = {x,,...,x,} with 
probability distribution u compactly supported in a metric space (X,d), the result (X,,ux,) of ap- 
plying single linkage clustering to (X,,d) converges almost surely in the Gromov-Hausdorff sense 
to an ultrametric space that recovers the multiscale structure of the support of u, see Figure 20. 
This can be interpreted as a refinement of a previous observation (Hartigan, 1985) that SLHC is 
insensitive to the distribution of mass of u in its support. 


1.3 Organization of the Paper 


This paper is organized as follows: §A provides a list of all the notation defined and used throughout 
the paper; §2 introduces the terminology and basic concepts that we use in our paper; §3.2 reviews 
hierarchical clustering methods in general; §3.3 discusses the representation of dendrograms as ul- 
trametric spaces and establishes the equivalence of both repersentations; and §3.5 delves into the 
issue of constructing a notion of distance between dendrograms which is based in the equivalence 
of dendrograms and ultrametrics; §3.6 comments on issues pertaining to the theoretical properties 
of HC methods. In §4 we present our characterization result, Theorem 18, for SL in a spirit similar 
to the axiomatic treatment of Kleinberg. We delve into the stability and convergence questions of 
SL in §5, where we introduce all the necessary concepts from Metric Geometry. Proposition 26 and 
Theorem 28 contain our results for the deterministic case. In §5.3 we prove a probabilistic con- 
vergence result Theorem 30 that hinges on a general sampling theorem for measure metric spaces, 
Theorem 34. Finally, we conclude the paper with a discussion on future directions. 
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For clarity of exposition, we have chosen to move most of the proofs in this paper to an appendix. 
The ones which remain in the main text are intended to provide intuition which would not otherwise 
be there. 


2. Background and Notation 


A metric space is a pair (X,d) where X is a set and d : X x X — R? satisfies 
1. For all x,x’ € X, d(x’,x) = d(x,x’) > 0 and d(x,x’) = 0 if and only if x = x’. 
2. For all x,x’,x” € X, d(x,x") < d(x,x') +d(x',x"). 
A metric space (X,u) is an ultrametric space if and only if for all x,x',x” € X, 
max (u(x,x’),u(x',x")) > u(x,x"). (1) 


Ultrametric spaces are therefore metric spaces which satisfy a stronger type of triangle inequal- 
ity. It is interesting to observe that this ultrametric triangle inequality (1) implies that all triangles 
are isosceles! 

Notice that by iterating the ultrametric property one obtains that if x,,x2,...,x, 1S any set of k 
points in X, then 


max (u(x1,x2),U(x2,X3),..-,U(%e-1,%K)) > u(x1, Xx). 


For a fixed finite set X, we let U(X) denote the collection of all ultrametrics on X. For n € N let 
Xn (resp. U,,) denote the collection of all metric spaces (resp. ultra-metric spaces) with n points. Let 
X = | |n+1 Xn denote the collection of all finite metric spaces and U = |_|, Un all finite ultrametric 
spaces. For (X,d) € X let 


sep(X,d):= mind(x,x') and diam (X,d) := maxd(x,x’) 


XAX XX 


be the separation and the diameter of X, respectively. 

We now recall the definition of an equivalence relation. Given a set A, a binary relation is a 
subset 5 c A x A. One says that a and a’ are related and writes a ~ a’ whenever (a,a’) € S. S is 
called an equivalence relation if and only if for all a,b,c € A, all the following hold true: 


e Reflexivity: a ~ a. 
e Symmetry: if a ~ b then b ~a. 
e Transitivity: ifa ~ b and b ~ c thena ~c. 


The equivalence class of a under ~, denoted [a], is defined as all those a’ which are related to 
a: |a| = {a € A, s.t. a’ ~a}. Finally, the quotient space A\ ~ is the collection of all equivalence 
classes: A), ~:= {[a],a € A}. 

We now construct our first example which will be crucial in our presentation. 


Example 1 (r-equivalence) Given a finite metric space (X,d) and r > 0 we say that points x,x' € X 
are r-equivalent (denoted x ~, x’) if and only if there exists points xo,X1,...,X, E€ X with xo = x, 
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Figure 3: Illustration of the equivalence relation ~,. A finite metric space X is specified by the 
points in orange which are endowed with the Euclidean distance. This construction can 
be understood as allowing the creation of edges joining two points whenever the distance 
between them does not exceed r. Then, two points x and x’ in black are deemed r- 
equivalent if one can find a sequence of edges on the resulting graph connecting x to 
x’. From left to right and top to bottom we show the resulting graph one obtains for 4 
increasing values of r. The points x and x’ are not r-equivalent when r = r1,r2 or r3, but 


they are r4-equivalent. 


x, =x’ and d(xj,xi41) <r fori=0,...,t—1. It is easy to see that ~, is indeed an equivalence 


relation on X. 


This definition embodies the simple idea of partitioning a finite metric space into path connected 
components, where the granularity of this partitioning is specified by the parameter r => 0, see Figure 


L 


1. Indeed, assume that all sides a,b,c of a triangle in a given ultrametric space are different. Then, without loss of 
generality a > b > c. But then, a > max(a,b) which violates (1). Hence, there must be at least two equal sides in 


every triangle in an ultrametric space. 
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For a finite set X, and a symmetric function W : X x X — Rt let L(W) denote the maximal 
metric on X less than of equal to W (Bridson and Haefliger, 1999), that 1s, 


m—| 
L(W)(x,x) = win 9 Waits A tote = / 


i=0 


for x,x € X. 

For a finite set X, we let C(X) denote the collection of all non-empty subsets of X. By P(X) 
we denote the set of all partitions of X. For a given partition II € P(X) we refer to each Be II as a 
block of II. For partitions II, I” e P(X), we say that I is coarser than IT’, or equivalently that IT’ is 
a refinement of II, if for every block B’ e IT’ there exists a block Be TI s.t. B’ c B. 

For k € N and r > 0 let S*~!(r) c R* denote the (k — 1) dimensional sphere with radius r. By 
((a)) we will denote a matrix of elements a;j. 


3. Hierarchical Clustering: Formulation 


In this section we formaly define hierarchical clustering methods as maps that assign a dendrogram 
to a finite metric space. First, in §3.1 formalize the standard concept of dendrogram; then, in §3.2 
we present a formal treatment of HC methods which emphasizes the need for a formulation that is 
insensitive to arbitrary choices such as the labels given to the points in the data set. Finally, in §3.3 
we prove that the collection of all dendrograms over a finite set is in a one to one correspondence 
with the collection of all ultrametrics on this set. We then redefine HC methods as maps from 
the collection of finite metric spaces to the collection all finite ultrametric spaces. This change 
in perspective permits a natural formulation and study of the stability and convergence issues in 
later sections of the paper. In particular, in §3.5, we discuss the construction of notions of distance 
between dendrograms by appealing to the ultrametric representation. These notions are instrumental 
for the arguments in §5. 

Finally, in §3.6, we disgress on some critiques to the classical HC methods. The situation with 
HC methods is seemingly paradoxical in that SL is the one that seems to enjoys the best theoretical 
properties while CL and AL, despite exhibiting some undesirable behaviour, are the usual choices 
of practicioners. 


3.1 Dendrograms 


A dendrogram over a finite set X 1s defined to be nested family of partitions, usually represented 
graphically as a rooted tree. Dendrograms are meant to represent a hierarchical decompositions 
of the underlying set X, such as those that are produced by hierarchical clustering algorithms, and 
therefore the nested family of partitions provided must satisfy certain conditions. We formally 
describe dendrograms as pairs (X,@), where X is a finite set and 0: [0, 00) — P(X). The parameter 
of O usually represents a certain notion of scale and it is reflected in the height of the different levels, 
see Figure 3.1. We require that O satisfies: 


1. 0(0) = {{x1$,...,{xn}}. This condition means that the initial decomposition of space is the 
finest possible: the space itself. 


2. There exists to s.t. O(t) is the single block partition for all t > tọ. This condition encondes the 
fact that for large enough f, the partition of the space becomes trivial. 
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d Du bro rs 
Figure 4: A graphical representation of a dendrogram over the set X = {x,,x2,x3,x4}. Let 0 de- 
note the dendrogram. Notice for example that O(a) = { {x1}, {x2}, {x3}, {xa}}; O(b) = 
(lo, x2}, {23}, {x4} }; O(c) = { (1, x2}, Is, alt: and O(t) = {x1 ,x2,x3,x4} for any t > r3. 


3. If r < s then 0(r) refines O(s). This condition ensures that the family of partitions provided 
by the dendrogram is indeed nested. 


4. For all r there exists € > 0 s.t. 0(r) = O(t) for t e [r,r +e]. (technical condition) 


Let D(X) denote the collection of all possible dendrograms over a given finite set X. When 
understood from context, we will omit the first component of a dendrogram (X,0) € D(X) and refer 
to ð as a dendrogram over X. 


Remark 3 (About our definition of dendrogram) Our definition coincides with what Jain and 
Dubes call proximity dendrograms in Jain and Dubes (1988, §3.2). We stress that we view the 
parameter t in our definition as part of the information about the hierarchical clustering. Jain and 
Dubes also discuss a simpler version of dendrograms, which they call threshold dendrograms, which 
retain merely the order in which succesive partitions are created. These of course can be viewed as 
functions from N into P(X) satisfying the constraints (1), (2) and (3) above, instead of having the 
domain IO. oc). 
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It seems that proximity dendrograms are the type of dendrograms that are most often employed 
by practicioners and statisticians, see for example the dendrograms provided by the statistical soft- 
ware R? and by Matlab’s statistics toolbox,’ whereas threshold dendrograms are more popular in 
the Machine Learning and Computer Science communities. 


Usually, Hierarchical Clustering methods are defined as those maps that to each finite metric 
space (X,d) assign a dendrogram over X. 
Using the definitions above we now construct our first example. 


Example 2 For each finite metric space (X,d) let (X,9*) € D(X) be given by 0* (r) = X\ ~,. In 
other words, for each r > 0, 8*(r) returns the partition of X into ~,-equivalence classes. Recall 
(Example 1) that two points x amd x’ are ~, equivalent if and only if one can find a sequence of 
points x9,X1,...,X, S.t. the first of them is x and the last one is x’ and all the hops are smaller 
than r: max;dx(Xx;,Xi41) <r. We will see below that this definition coincides with single linkage 
hierarchical clustering. See Figure 2 for an illustration of this concept. 


X2 
X3 X8 
Xo 


X10 


"e 


11 


Figure 5: For the same finite metric space X of Example 1 and the value r = m, X\ ~,,= 
{ {x1,x2,X3,X4,X5, X6}, {x7,x8}, {x9}, {x10,x11}}, that is, ~,, splits X into four path con- 
nected components. 


2. Available at http://www. r-project.org/. 
3. Available at http: //www.mathworks.com/products/statistics/. 
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In order to build up intuition about our definitions, we prove that (X,9*) is indeed a dendro- 
gram. Since X is a metric space, x ~o x’ if and only if x = x’. Thus condition (1) above is satisfied. 
Clearly, for t > diam(X,d), x ~: x’ for all x,x’, and thus condition (2) holds. Fix 0 <r < s and 
let B be a maximal connected component of 8*(r) and let x,x € B. Then, by definition of 8*(r), 
x~, X. But it follows from the definition of ~, that if x ~, x’, then x ~; x' for all s > r. Hence, x,x' 
are in the same block of 8*(s) and condition (3) holds. Condition (4) holds since clearly ®* is right 
continuous, has finitely many discontinuity points, and is piecewise constant. 


We now need to discuss a formal description of agglomerative HC methods. 


3.2 A General Description of Agglomerative Hierarchical Clustering Methods 


In this section we give a description of agglomerative HC methods that is suitable for our theoretical 
analyses. Standard algorithmic descriptions of HC methods typically make the assumption that in 
the merging process there are only two points at minimal linkage value of eachother. For example, 
the formulation of Lance and Williams (1967) does not specifically explain how to deal with the 
case when more than two points are candidates for merging. In practice one could argue that if at 
a certain stage, say, three points are at minimal linkage value of eachother, then one could proceed 
to merge them two at a time, according to some predefined rule that depends on the indices of the 
points. 

Whereas this tie breaking strategy seems reasonable from a computational point of view, it 
invariably leads to dendrograms that depend on the ordering of the points. This is no doubt an 
undesirable feature that can be translated into, for example, that the results of the clustering methods 
depend on the order in which the data samples were obtained. Single linkage HC 1s exempted from 
this problem however, because of the fact that at each stage only minimal distances are taken into 
account. In contrast, complete and average linkage will produce results that do not behave well 
under reordering of the points. 

The problems arising from ad hoc tie breaking are often not even mentioned in books on clus- 
tering. A notable exception is the book Jain and Dubes (1988), especially Section §3.2.6, where the 
reader can find a careful exposition of these issues. 

Below, we formulate HC methods in a way that is independent of these extraneous features. 
In order to do so , we need to have some kind of invariance in the formulation. More precisely, 
let (X,dx) be the input metric space, where we assume that X = {1,...,n} consists of exactly n 
points. Write (LN. Dev is the output dendrogram of a given HC method applied to (X, dy). Let m be 
a permutation of the indices {1,2,...,n}, and (Y,dy) be the metric space with points {1,...,} and 
permuted metric: dy (i, j) := delt, cl for all i, j€ {1,...,n}; further, denote by (Y, Oy ) the output 
dendrogram of the same HC method applied on (Y, dy). Then, we require that for all permutations 
T, the result of computing the dendrogram first and then permuting the result is the same as the 
result of first permuting the input distance matrix and then computing the output dendrogram: 


To Ox(t) = Oy (t), for all t > 0. (2) 
Formally, the action of a permutation 7 over a partition (such as Oe (t)) above must be understood 


in the following sense: if P = {B,,...,B,} is a partition of {1,2,...,n}, then ToP is the partition 
with blocks {no B;, 1 < i < r}, where in turn mo 8; consists of all those indices 1; for j € B;. 
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We elaborate on this in the next example. We first recall the usual definition of CLHC, and then 
construct a simple metric space consisting of five points where this usual formulation of CL fails to 
exhibit invariance to permutations. 


3.2.1 THE STANDARD FORMULATION OF COMPLETE LINKAGE HC 


We assume (X, ((d))) is a given finite metric space. In this example, we use the formulas for CL 
but the structure of the iterative procedure in this example is common to all HC methods (Jain and 
Dubes, 1988, Chapter 3). Let O be the dendrogram to be constructed in this example. 


1. Set Xo = X and Do = ((d)) and set 0(0) to be the partition of X into singletons. 


2. Search the matrix Do for the smallest non-zero value, that is, find p = Sept Kon), and find 
all pairs of points {(x; Xj), Där)... Da, Aral at distance 59 from eachother, that is, 
d(Xi,;Xj.,) On for all & = 1,2,...,k, where one orders the indices s.t. ij < i2 < ... < ig. 


3. Merge the first pair of elements in that list, (x;,,x;,), into a single group. The procedure now 
removes (x;,,x;,) from the initial set of points and adds a point c to represent the cluster 
formed by both: define X; = (Xo\{xi,,x;,}) Y {c}. Define the dissimilarity matrix Dı on X; x 
X, by Dı (a,b) = Do(a,b) for all a,b # c and Dj (a,c) = Di (c,a) = max (Do(xi a), Do(x;,,4)) 
(this step is the only one that depends on the choice corresponding to CL). Finally, set 


dëst = {Xi Xa} U 5 {xi}. 


(it, D 


4. The construction of the dendrogram O is completed by repeating the previous steps until all 
points have been merged into a single cluster. 


Example 3 (about the standard formulation of complete linkage) The crux of the problem lies 
in step 3 of the procedure outlined above. The choice to merge just the first pair of points in the list 
causes the procedure to not behave well under relabeling of the points in the sense of (2). 

An explicit example is the following: consider the metric space ({1,2,3,4,5}, ((d))) with five 
points and distance matrix 


123 4 5 
1/0125 5 
21103 6 6 
(d))=3}2 3 03 7 
415 63 0 4 
5\6 7 4 6 0. 


This metric space arises from considering the graph metric on the graph depicted in Figure 6. 
Under CLHC (as defined in §3.2.1), and under the action of all possible permutations of the labels 
of its 5 points, this metric space produces 3 different non-equivalent dendrograms, see Figure 7. 
This is an undesirable feature, as discussed at length in Jain and Dubes (1988, Chapter 3). 


We now re-define general HC methods in a way that they satisfy (2). 
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Figure 6: A finite metric space that under permutations leads to different outputs of the usual CL 
HC algorithm, see text for details. The metric is defined by the graph distance on the 
weighted graph shown. 


3.2.2 THE PERMUTATION INVARIANT FORMULATION 


Here we consider the family of Agglomerative Hierarchical clustering techniques (Jain and Dubes, 
1988, Chapter 3). We define these by the recursive procedure described next. The main difference 
with §3.2.1 lies that in Step 3 we will allow for more than just two points into the same cluster and 
also, it could happen, for example, that four points A, B,C, D merge into two different clusters {A, B} 
and {C, D} at the same time. 

Let the finite metric space (X,d) be given where X = {x ,...,x,} and let L denote a family of 
linkage functions on X: 


L:= {£: C(X) x C(X) > Rt} 


with the property all that £ € L are bounded non-negative functions. These functions assign a non- 
negative value to each pair of non-empty subsets of X, and provide a certain measure of distance 
between two clusters. Let B, B'e C (X), then, some possible standard choices for £ are: 


e Single linkage: '(B,B’) = min,eg min, cg d(x,x’); 

e Complete linkage: ('(B,B') = max,eg max, d(x,x’); and 
e Average linkage: (*'(B, B’) = Pac Steg dax). 

e Hausdorff linkage: (°"(B,B') = dat. B').4 

The permutation invariant formulation is as follows: 


4. The Hausdorff distance is defined in Definition 21. 
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1. Fix Le L. For each R > 0 consider the equivalence relation ~p pg on blocks of a partition 
Ile P(X), given by B ~¢pr B if and only if there is a sequence of blocks B = B),...,B, = B’ 
in II with C( Bx, Br+1) <Rfork =1,...,s—1. 


2. Consider the sequences R1,Ro,...€[0,00) and @;,@2,...€ P(X) given by O1 := {x1,..., Xn}, 
and recursively for i > 1 by Oj4; = O;/ ~e r; where 


Ri := min{¢(B, B’); B, Be Oi, BA BY. 


Note that this process necessarily ends in finitely many steps. This construction reflects the 
fact that at step 7 one agglomerates those clusters at distance < R; from eachother (as measured 
by the linkage function 4). More than two clusters could be merged at any given step. 


3. Finally, we define 0° : [0,00) > P(X) by r> 0‘ (r) := Dua where i(r) := max {i|R; < r}. 


Remark 4 (About our definition of HC methods) Note that, unlike the usual definition of ag- 
glomerative hierarchical clustering $3.2.1 (Jain and Dubes, 1988, $3.2), at each step of the in- 
ductive definition we allow for more than two clusters to be merged. Of course, the standard for- 
mulation can be recovered if one assumes that at each step i of the algorithm, there exist only two 
blocks B and B' in ©; s.t. Ri = €(B, B"). Then, at each step, only two blocks will be merged. 


Example 4 Note for example that for the five point metric space in Example 3, the result of applying 


CL (according to the permutation invariant formulation) is the dendrogram in Figure & (a). It also 


012 
follows, for example, that when applied to the metric space L3 := (ai Ch ( 1 0 l ) ), which can 


© 1 O 1 O 
be represented by three points on a line: , SL, AL and CL all yield the same 


dendrogram, which is shown in Figure 8 (b). 


= N W A UI 


Figure 8: (a) shows the result of applying the permutation invariant formulation of CL to the five 
point metric space of Example 3 (see also Figure 6). (b) shows the dendrogram that one 
obtains as output of (the permutation invariant formulation of) SL, AL and CL applied to 
the metric space L3. 
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Proposition 5 We have the following properties of the construction above: 
e Fori=1,2,..., Oj41 is coarser than ©; and 
e Kj) È Ki. 
e DI is a dendrogram over X. 


Proof The only non trivial claim is that Rj, > R;, which can be proved by induction on i. = 


Remark 6 From this point forward, all references to SL, AL, and CL clustering will be to the 
permutation invariant formulation, in which more than two clusters can be merged at a given Step. 


The following result is clear, and we omit its proof. 


Proposition 7 The above construction of hierarchical clustering algorithms (including SL, AL, and 
CL) yields algorithms which are permutation invariant. 


A simplification for SL HC. In the particular case of SL, there is an alternative formulation that uses 
the equivalence relation introduced in Example | and its associated dendrogram (Example 2). The 
proof of the following Proposition is deferred to the appendix. 


Proposition 8 Let (X,d) be a finite metric space and ®°" be the dendrogram over X obtained by 
the single linkage agglomerative procedure described above, and let 9* be the dendrogram over X 
constructed in Example 2. Then, 0°“(r) = 0*(r) for all r > 0. 


3.3 Dendrograms as Ultrametric Spaces 


The representation of dendrograms as ultrametrics is well known and it appears in the book by 
Jardine and Sibson (1971), it has already been used in the work of Hartigan (1985), and is touched 
upon in the classical reference of Jain and Dubes (1988, §3.2.3). 

We now present the main ideas regarding this change in perspective which we will adopt for 
all subsequent considerations. The formulation of the output of hierarchical clustering algorithms 
as ultrametric spaces is powerful when one is proving stability results, as well as results about the 
approximation of the dendrograms of metric spaces by their finite subspaces. This is so because 
of the fact that once a dendrogram is regarded as a metric space, the Gromov-Hausdorff metric 
provides a very natural notion of distance on the output, in which the right kind of stability results 
are easily formulated. We state these theorems in §5. 

The main result in this section is that dendrograms and ultrametrics are equivalent. 


Theorem 9 Given a finite set X, there is a bijection ¥ ` D(X) — U(X) between the collection D(X) 
of all dendrograms over X and the collection U(X) of all ultrametrics over X such that for any 
dendrogram 8 € D(X) the ultrametric ‘¥(8) over X generates the same hierarchical decomposition 
as 9, that is, 

(«) for eachr>0,x,x €BeO(r) — > Y(0)(x, Aler 


Furthermore, this bijection is given by 


W(0)(x,x’) = min{r > 0|x,x belong to the same block of 8(r)}. 
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In order to establish the above theorem, we first construct certain natural mappings from D(X) 
to U(X) and from U(X) to D(X), and we then prove they are inverses of eachother and satisfy (*). 


3.3.1 FROM DENDROGRAMS TO ULTRAMETRICS 


Let X be a finite set and 0 : [0,00) — P(X) a dendrogram over X. Consider the symmetric map 


ug : X x X > RI given by 


(lr min{r > Or, vd belong to the same block of 8(r)}. 


See Figure 9 for an illustration of this definition. Note that condition (4) in the definition of 
dendrograms guarantees that ug is well defined. It is easy to see that ug defines an ultrametric on X: 


Lemma 10 Let X be a finite set and (X,8) € D(X). Then ug : X x X — R* defined in (3) is an 


ultrametric. 


Ai 

X] A 

= X27 In 
(ue) = 3? | 
X4 E 





| ! e 


r] r2 E 


Figure 9: A graphical representation of a dendrogram D over X = {x1,x2,x3,x3} and the ultrametric 
ug. Notice for example, that according to (3), ug(x1,x2) = rı since rı is the first value 
of the (scale) parameter for which x; and x2 are merged into the same cluster. Similarly, 
since x; and x3 are merged into the same cluster for the first time when the parameter 


equals r3, then ug(x1,x3) = r3. 


3.3.2 FROM ULTRAMETRICS TO DENDROGRAMS 
Conversely, given an ultrametric u : X x X — R*, its associated dendrogram 


o" : (0,00) > P(X) 


can be obtained as follows: for each r > 0 let 0”(r) be the collection of equivalence classes of 
X under the relation x ~ x’ if and only if u(x,x’) < r. That this defines an equivalence rela- 
tion follows immediately from the fact that u is an ultrametric. Indeed, assume that x ~ x’ and 
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x’ ~ x" for some r > 0. Then, u(x,x’) < r and u(x’,x”) < r. Now, by the ultrametric property, 
max (u(x,x’),u(x’,x”)) > u(x,x”) and hence u(x,x”) < r as well. We conclude that x ~ x” thus 
establishing the transitivity of ~. 


Example 5 Consider the ultrametric u on X = {x1,X2,...,x6} given by 
Xi, XQ X3 X4 X5 Ae 
x/0 2 2 5 6 6 
x| 2 0 2 5 6 6 
mm) 2 2 0 5 6 6 
Gef EECHER 
el 6 6 6 6 0 4 
%\6 6 6 6 4 Q. 


Then, for example ër) = { {x1}, {x2}, {x3}, Le). {x5}, Lol) 0G) = (Lar e. sl. {x4}, {x5}, Cro} 
0” (4.5) = {{x1, x2, x3}, {xa}, (5, x6}}, 0” (5.5) = {{x1,x2,x3, x4}, {x5,x6}} and 
0”(7) = {x1,X2,X3,X4, X5, X6 . A graphical representation of the dendrogram ©" is given in Figure 
10. 


Ai 
X2 
X3 
X4 
X5 


X6 


Figure 10: A graphical representation of the dendrogram 0” of Example 5, see the text for details. 


3.3.3 THE CONCLUSION OF THE PROOF OF THEOREM 9. 


It is easy to check that (1) given any dendrogram 9 on X, 0% = 0 and (2) given any ultrametric u on 
X, ugu = u. Now, let ¥ : D(X) — U(X) be defined by 0 +> '¥(6) := ug. By construction we see that 
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WY: D(X) — U(X) is a bijection and that Y~! is given by u+> 0”. From (3) we see that satisfies 
(al, Hence, we obtain Theorem 9. 


From now, whenever given a dendrogram Oy over a set X, we will be using the notation 
DI, ) for the ultrametric associated to X given by Theorem 9. In a similar manner, given an 


ultrametric u on X, HD" ` (u) will denote the dendrogram over X given by Theorem 9. 





3.4 Reformulation of Hierarchical Clustering using Ultrametrics 


In the sequel, appealing to Theorem 9 which states the equivalence between ultrametrics and 
dendrograms, we represent dendrograms as ultrametric spaces. Then, any hierarchical clustering 
method can be regarded as a map from finite metric spaces into finite ultrametric spaces. This 
motivates the following definition: 


Definition 11 A hierarchical clustering method is defined to be a map 
T:X>U st X™3a(X,d)- (X,u) e Uy, nen. 
Example 6 For a given finite metric space (X ,d) consider the HC method Z°} given by 5° (X,d) = 


(X, lt ett where 0°" is the single linkage dendrogram over X defined in $3.2. Similarly, we define 
Sand Zae 


Example 7 (maximal sub-dominant ultrametric) There is a canonical construction: Let &* : 
X — U be given by (X,d) +> (X,u*) where 


UP ex ee min max , dml SL A= E ke= d (4) 
i=0,...,k— 
We remark that the minimum above is taken over k € N and all k + 1-tuples of points xo,x1,...,X; in 


X s.t. X9 = x and x, = x’. Notice that for all x,x' € X, u*(x,x’) < d(x,x’). 

This construction is sometimes known as the maximal sub-dominant ultrametric and it has the 
property that if u < d is any other ultrametric on X, then u < u*. 
this canonical construction is equivalent to the ultrametric induced by the equivalence relation in 


Example 1. 


The Lemma below proves that 


Lemma 12 For (X,d) € X write T*(X,d) = (X,u*) and let (X,0*) € D(X) be the dendrogram 
arising from the construction in Example 2. Then, u* = ¥(0*). 


Remark 13 Notice that another way of stating the Lemma above is that x ~, x if and only if 
Ee 


It turns out that T* yields exactly single linkage clustering as defined in §3.2. 


Corollary 14 One has that Z% = &*. 
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Equivalently, for any finite metric space X, the single linkage dendrogram OSÉ on X agrees with 
P~! (u*). 
Proof The proof follows easily from Proposition 8 and Lemma 12. E 


We emphasize that, as it follows from Corollary 14, GC" produces ultrametric outputs which 


are exactly those corresponding to SLHC. We will use this fact strongly in the sequel. 





3.4.1 INTERPRETATION OF THE ULTRAMETRIC 


For a HC method ¥ and (X,d) € X, let T(X,d) = (X,u). The intuition that arises from (3) is that 
for two points rd € X, u(x,x’) measures the minimal effort method T makes in order to join x to x’ 
into the same cluster. 

We note in particular that a desirable property of a HC algorithm should be that upon shrinking 
some of the distances in the input metric space, the corresponding “efforts” also decrease. This 
property is exactly verified by GC". Indeed, let X be a finite set and dı and dz two metrics on X s.t. 
dh > d2. Write {*(X,d,) = (X,u;) and T*(X,d2) = (X,už). Then, it follows immediately from 
Equation (4) that uf > už (compare with Kleinberg’s consistency property, pp 1425). 

Observe that CL and AL HC fail to satisfy this property. An example is provided in Figure 19. 

We see in Theorem 18 that a condition of this type, together with two more natural normalizing 
conditions, completely characterizes SLHC. 


3.5 Comparing results of Hierarchical Clustering Methods 


One of the goals of this paper is to study the stability of clustering methods to perturbations in the 
input metric space. In order to do so one needs to define certain suitable notions of distance between 
dendrograms. We choose to do this by appealing to the ultrametric representation of dendrograms, 
which provides a natural way of defining a distance between hierarchical clusterings. We now delve 
into the construction. 

Consider first the simple case of two different dendrograms (X,&) and (X,B) over the same 
fixed finite set X. In this case, as a tentative measure of dissimilarity between the dendrograms we 
look at the maximal difference between the associated ultrametrics given by Theorem 9: ua = ‘PY (a) 
and ug = ‘P(B): max, vex |Ua(x,x) — ug(x,x’)|. There is a natural interpretation of the condition that 
Max, x/ex |Ua(x,x’) — ug (x, x)| < €: if we look at the graphical representation of the dendrograms o 
and B, then the transition horizontal lines in Figure 11 have to occur within £ of eachother.” This is 
easy to see by recalling that by (3), 


Ug(x,x ) = min{r > 0|x,x belong to the same block of a(r)} 


and 
ug(x,x') = min{r > 0|x,x' belong to the same block of B(r)}. 


For the example in Figure 11, we then obtain that max; |r; — r;| < €, which is not surprising since 
r2 = Ug(X1,X2), r} = ug(x1,x2), ete. 


5. These lines represent values of the scale parameter for which there is a merging of blocks of the partitions encoded 
by the dendrograms. 
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A A X 3 X2 X1 


Figure 11: Two different dendrograms (Xo) and (X,B) over the same underlying set X = 
{x1,%2,%3,x4}. The condition that |ua — ug||z(xxx) < E is equivalent to the horizon- 
tal dotted lines corresponding to r; and r; (i = 1,2,3) being within e of eachother. 


Now, in a slightly more general situation we may be faced with the task of comparing two dif- 
ferent dendrograms & and B without knowing (or caring about) the exact labels of the points. In this 
case, a natural solution is to look at the minimum of the maximum difference of the corresponding 
ultrametrics under all possible permutations, namely: 


min max |ua(x, x) — ug (R(x), T(x") )|, (5) 
TEP, x, xX’ EX 


where n is the cardinality of X and P, is the collection of all permutations of n elements. 

The most general case arises when we do not know whether the dendrograms come from the 
same underlying set or not. This situation may arise, for example, when comparing the results of 
clustering two different samples, of possibly different sizes, coming from the same data set. One 
may want to be able to compare two such clusterings as a way to ascertain whether the sample size 
is sufficient for capturing the structure of the underlying data set. 

Assume then that we are given (X1, @) and (X2, B), two different dendrograms, defined possibly 
over two different sets Kr and X2 of different cardinality. This potential difference in cardinality in 
the two sets forces us to consider transformations other than mere permutations. A natural solution, 
which can be interpreted as a relaxation of the permutation based distance (5) discussed above, is 
to consider maps f : Xı — X2 ander X2 — X, and look at their distortions: 
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dis(f) := max |ug(x,x’) —ug(f (x), f), 


x,x'EX] 


dis(g) := max |ua(g (x), g(x )) — ug (x,x)]. 


x,x'EX 


The next natural step would be to optimize over the choice of f and g, for example by minimizing 
the maximum of the two distortions: 


nen (dis(f),dis(g)). 
8 

This construction is depicted in Figure 12. Roughly speaking, this idea leads to the Gromov- 
Hausdorff distance. The difference lies in the fact that in standard definition of the Gromov- 
Hausdorff distance, one also considers a term that measures the degree to which f and g are inverses 
of eachother. Being more precise, given the maps f and g, this term, called the joint distortion of f 
and g is given by 


dis(f,g):= max, M eO =u] 


One defines the Gromov-Hausdorff distance between (X1, ua) and (X2, ug) by 


dear EY Kai = 5 minmax (dis(f),dis(g), daf. eil D (6) 
We now see exactly how the inclusion of the new term enforces f and g to be approximate 
inverses of eachother. Assume that for some € > 0 doa (X1,X2) < €, then, in particular, there 
exist maps f and g such that |ug(x,g(y)) —ug(y, f(x))| < 2e for all x € X; and y € X2. Choosing 
y = f(x), in particular, we obtain that uq(x,g(f(x))) < 2e for all x€ X;. Similarly one obtains that 
ug(y, f(g(y))) < Ze for all y € X2. These two inequalities measure the degree to which f og and 
go f differ from the identities, and thus, measure the degree to which f and g fail to be inverses of 
eachother. This is a useful feature when one considers convergence issues such as we do in §5. 


3.5.1 INTERPRETATION OF THE GROMOV-HAUSDORFF DISTANCE IN TERMS OF 
DENDROGRAMS 


Assume that dear ((X1,Ua); (X2,up)) < 2 for some n > 0. Then there exist maps f : X — Y and 
g:Y — X such that the following conditions hold (see Figure 13): 


e If x,x’ fall in the same block of a(t) then f(x), f(x’) belong to the same block of Bird) for all 
t >t+N. 


e If y,y’ fall in the same block of Bir) then g(y), g(y’) belong to the same block of a(t’) for all 
t >t+N. 


For the next section we do not need to make use of the full generality in these considerations: 
there we only compare dendrograms defined over the same underlying set. A more detailed use and 
additional material about the Gromov-Hausdorff ideas is given in §5. 

We finish this section with a precise result regarding the stability of dendrograms arising from 
SLHC. 
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x. —— 


Vi 





(X, a) (Y, B) 


Figure 12: In this example, two different dendrograms, (X,œ) and (N.B are given. The (straight) 
arrows pointing from left to right show a map f : X — Y, and the (curved) arrows 
pointing from right to left show the map g : Y — X. With simple explicit computa- 
tions one sees that these choices of maps f and e incur distortions dis( f) = dis(f,g) = 
max (r3, |rı — ril; |r2 nl and dis(g) = max (|r) — nilles —14|), respectively. Hence, 
we see that doar ((X, to, (Y, #(B))) < $ max (r3, |r) = ri], |r2 —74)). 


The following Lemma deals with the situation when we have a fixed finite set P and two different 
metrics on P and then we compute the result of applying 3" each of these metrics. This lemma is a 
particular case of our main stability result, Proposition 26 in §5. In the interest of clarity, we prove 
it here to provide some intuition about the techniques. 


Lemma 15 Let P be a fixed finite set and let dir, d be two metrics on P. Write &*(P,d;) = (P, uj), 
i = 1,2. Then, 


max Ju (p,q) — u2(p,q)| < max |d: (p,q) — d2(p,q)|. 

Dach p,qEP 
Proof Let n = max, ger ldi (p,q) — do(p,q)|. Let po,...,pk E P be s.t. po = p, pk = q and 
max;dı (Pi, Pi+1) = uı (p,q). Then, by definition of uz (which is the minimum over all chains of 
the maximal hop measured with metric d2) and the fact that d2 < di +N: 


u2(p,q) < max d2(pi, pi+1) < max (N + dı (pi, Pi+1)) = N + u (p,q). 


Similarly, u; (p,q) < N + u2(p,q), and hence |u: (p,q) — u2(p,4)| < n. The claim follows since 
p,q € P are arbitrary. E 


6. The factor 5 is of course inmaterial but kept here for coherence with the standard definition. 
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(Y, B) 


Figure 13: These are the same dendrograms as in Figure 12. Let rı = d r = l, r3 = L, ri = a and 


r, = 3. For the maps f and g s.t. f(x1) = f(x2) =y1, f (x3) = ya, f(x4) = y3, 801) = 
x1, g(y2) = x3 and g(y3) = x4, using the formulas computed in Figure 12 we see that 
dis( f) = dis(g) = dis(f,g) = 4 and hence dey ((X,¥(a)), (Y, #(B))) < %. Now notice 
for instance that x3 and x4 fall in the same block of a(r2) = a@(1) and that y2 = f(x3) 
and y3 = f (x4) fall in the same block of B(t’) for all d > r2 +2-4=14+4= 5 =r}. 


3.6 Some Remarks about Hierarchical Clustering Methods 


Practitioners of clustering often prefer AL and CL to SL because it is perceived that the former two 
methods tend to produce clusters which are more coherent conceptually, and which are in a non- 
technical sense viewed as more compact. In fact, SL exhibits the so called chaining effect which 
makes it more likely to produce clusterings which separate items which conceptually should be 
together. We view these observations as evidence for the idea that good clustering schemes need to 
take some notion of density into account, rather than straightforward geometric information alone. 
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One can loosely argue that given the actual definition of the linkage functions used by AL and 
CL, these two methods do enjoy some sort of sensitivity to density. Unfortunately, AL and CL are 
unstable, and in particular, discontinuous in a very precise sense (see Remark 16 below), whereas 
SL enjoys all the nice theoretical properties that the other two methods lack. 

In this section we review this seemingly paradoxical situation. 

For each n € N let L, be a metric space with n points P = {p1,..., Pn} and metric dz, (pi, pj) = 
li— jl, i, 7 © {1,...,n}. Similarly, let A, be the metric space with the same underlying set and 
metric dy, (pi,pj) = 1, i, jE {1,...,n}, iA j. Clearly, the metric space L, is isometric to points 
equally spaced on a line in Euclidean space whereas (s.t. two adjacent points are at distance | from 
eachother) A, is isometric to the (n — 1)-unit-simplex as a subset of R”~!. 

Clearly, the outputs of Single Linkage HC applied to both L, and A, coincide for all n € N: 


Z (P,dz,) = = (P, dal (P, IN (7) 


where y;; = 0 if i = j and y;; = 1 if i ¥ j, for all n € N, see Figure 14. 


An @—@ A AN 
A A 
A, 


Figure 14: The metric spaces L, and A, both have n points. Single linkage HC applied to either of 
them yields the dendrogram in the center. 


By appealing to the Euclidean realizations of L, and A,, one can define perturbed versions 
of these two metric spaces. Indeed, fix € > 0 and let {a1,...,an} c [0,€/2] and {by,..., bn} € 
S"~!(e/2). Define LẸ to be the metric space with underlying set P and metric dre (pi, pj) = |i— j + 
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a; —aj;|. Similarly, define Af to be the metric space with underlying set P and metric dxe (pi, pj) = 
d — Sj + Db; — bc, 
Notice that by construction, 


mak dr. (Pi Pj) — dre (pi, Pj)| <€ (8) 


and 
max |da, (Pis Pj) — dag (Pi P3)| <€. (9) 


We thus say that the spaces (P,dy<) and (P,daz) are perturbed versions of (P,dz,) and (P,da, ), 
respectively. 


Remark 16 (About a critique to SL) Single linkage is generally regarded as a poor choice in 
practical applications. The reason for this is the so called chaining effect observed experimentally, 
which is central to the criticism to SL made in Lance and Williams (1967) (see also the discussion 
in Wishart, 1969, pp. 296). The following two observations are important: 


(O1) Itis generally argued that since (P,dj<) corresponds to points on the vicinity of a line, whereas 
(P, daz ) corresponds to points in the close vicinity of a (n—1)-simplex, then the cluster formed 
by points on the latter metric space is more compact or denser than the one formed by the 
former, and thus more meaningful. 


(02) The outputs of SL to the spaces (LP dre ) and (P,dae) are very similar and this similarity is of 
order €. 
Indeed, if we write Z*(P, dre) = (P,urs) and X*(P,daz) = (P, uae), then, by the triangle in- 
equality for the L® norm, 
[urg — use |zo(pxp) e Jus —Uzo||L~(PxP) (10) 
+ Jun — Ugo ||. (exp) 
+ uso — ux latp, 


As we pointed out in (7) at the beginning of Section §3.6, 


uro = ur, = ((Y)) = ua, = umo, 


thus, (10) simplifies into: 


AN 


[ure — Uae |L% (PxP) [urs — Hr zo (PxP) (11) 


Se [uag wi Uno || (Px P) 
(and by Lemma 15:) 


< dr; — dont, p 
+ |da — dyo||L>(Pxp)- 
Hence, by (11) and the construction of dre and da: (Equations (8) and (9)), we conclude that 
[urs — Was || Lo (Pxp) < 2€. 
This means that for any small perturbations of Ln and Au, the output of SL to these perturba- 


tions are at a small distance from eachother, as we claimed. 
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When put together, observations (O1) and (O2) suggest that SL is unable to pick denser as- 
sociations of data, such as cliques, over sparser ones, such as linear structures. This feature is 
undesirable in practical applications where often times one would like to regard clusters as modes 
of an underlying distribution (Wishart, 1969; Hartigan, 1981). 

It is then the case that in practical applications, CL and especially AL are preferred over SL. 
These two methods have the property that they indeed somehow favor the association of compact 
subsets of points. For CL this can be explained easily using the concept of maximal clique (maxi- 
mally connected sub-graphs of a given graph) (Jain and Dubes, 1988, Section 3.2.1). Let dr be the 
diameter of the cluster created in step k of CL clustering and define a graph G(k) as the graph that 
links all data points with a distance of at most dg. Then the clusters after step k are the maximal 
cliques of G(k). This observation reinforces the perception that CL yields clusters that are dense as 
measured by the presence of cliques. The sensitivity of AL to density has been discussed by Hartigan 
in Hartigan (1985, Section 3) and is basically due to the averaging performed in the definition of 
its linkage function. 

A more principled way of taking density into account, that does not depend on ad hoc construc- 
tions which destroy the stability property, would be to explicitly build the density into the method. 
In Carlsson and Mémoli (2009) we study multiparameter clustering methods, which are similar to 
HC methods but we track connected components in a multiparameter landscape. We also study the 
classification and stability properties of multiparameter clustering methods. 


Remark 17 (Instability of CL and AL) Ir turns out that CL and AL, despite not exhibiting the 
undesirable feature of the chaining effect, and despite being regarded as more sensitive to density, 
are unstable in a precise sense. Consider for example CL and let n = 3. In the construction of 
(P,d;) above let a, = az = 0 and a3 = e, then 


Hi P2 P3 Hi p2 P3 

pı /0O 1 2 Di 0 1 2+€ 

(d= po} 1 O 1 | and (deis pm 1 0 I1+E 
p \ 2 ] 0 p3\2+e I1+€ 0 


Write ZT (P, d) = (P,u,) and Z™(P,dẸ) = LP AE Clearly, 


Pi. P2 p3 pı p2 P3 

p/0 1 A nro 1 24e 

((uz))= p| 1 0 1 | and ((¢)= p| 1 0 2+e 
p3 1 1 0 p3 \2+€ 2+€ 0 


Notice that despite max; "ld (pi, pj) — d} (pi, p;)| = & max; j (uL (pi, Pj) —uț (pi Pj)| = 1+e> 1 
for all e > 0. We thus conclude that CL is not stable under small perturbations of the metric. Note 
that in particular, it follows that CL is not continuous. The same construction can be adapted for 
AL. See Figure 15. 


4. A Characterization Theorem for SL Hierarchical Clustering 


In this section we obtain a characterization of SL hierarchical clustering in terms of some simple 
axioms. The main axiom, (II) below, says that the clustering scheme has a prescribed behavior 
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Figure 15: Complete Linkage is not stable to small perturbations in the metric. On the left we show 
two metric spaces that are metrically very similar. To the right of each of them we show 
their CL dendrogram outputs. Regardless of € > 0, the two outputs are always very 
dissimilar. We make the notion of similarity between dendrogram precise in §5 by in- 
terpreting dendrograms as ultrametric spaces and by computing the Gromov-Hausdorff 
distance between these ultrametric spaces. 


under distance non-increasing maps of metric space. The behavior is that the map of metric spaces 
should induce a map of clusters, that 1s, that 1f two points in the domain space belong to the same 
cluster, then so do their images in the clustering of the image metric space. This notion, referred 
to as functoriality in the mathematics literature, appears to us to be a very natural one, and it is 
closely related to Kleinberg’s consistency property (cf. pp. 1425) for ordinary clustering methods; 
see Remark 19 for an interpretation of our axioms. 


Theorem 18 Let be a hierarchical clustering method s.t. 


D T({p,4} (30)) = ({P-a}, (3 6)) for all è > 0. 


(II) Whenever X,Y € X and 0: X — Y are such that dy(x,x') > dy(0(x),0(%’)) for all x,x’ € X, 
then 


ux (x,x°) > uy ((x),0(x')) 
also holds for all x,x' € X, where Z(X del = (X,ux) and ZUR, del = (Y,uy). prop 


(II) For all (X,d)€X, 
u(x,x’) > sep(X,d) for all x #x' € X 


where {(X ,d) = (X,u). 


Then Z = &*, that is, T is exactly single linkage hierarchical clustering. 
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Remark 19 (Interpretation of the conditions) Let (X,d) € X and write T(X,d) = (X,u). The 
intuition is that u(x,x’) measures the effort method Z makes in order to join x to x’ into the same 
cluster. 


Condition (I) is clear, the two-point metric space contains only one degree of freedom which has 
to determine unambiguously the behavior of any clustering method $. In terms of dendrograms, this 


means that the two point metric space (us E ò) ) must be mapped to the dendrogram where 


A and B are merged at parameter value Ò, see Figure 16. 


Condition (II) is crucial and roughly says that whenever one shrinks some distances (even to 
zero) to obtain a new (pseudo) metric space, then the corresponding efforts in this new space have 
to be smaller than the efforts in the original metric space. This is consistent with the notion that 
reducing the distance between two points (without increasing all other distances) makes them more 
likely to belong to the same cluster. 

Let Ox = Y! (ux) and By = Y! (uy) be the dendrograms associated to ux and uy. In terms of 
dendrograms, this means that if two points XX € X are in the same block of 9x (t) for some t > 0, 
then dc) and 0(x’) must be in the same block of Oy (t). see Figure 17. 

Condition (III) expresses the fact that in order to join two points x,x' € X, any clustering method 
T has to make an effort of at least the separation sep(X ,d) of the metric space. In terms of dendro- 


grams, this means that De (t) has to equal the partition of X into singletons forall O < t < sep(X,d). 
See Figure 18. 


yı 
io . > 
"A yə 
yı y2 





Figure 16: Interpretation of Condition I: For all 6 > 0 the two point metric space on the left must 
be mapped by % into the dendrogram on the right. 


Remark 20 It is interesting to point out why complete linkage and average linkage hierarchical 
clustering, as defined in §3.2.2, fail to satisfy the conditions in Theorem 18. It is easy to see that 
conditions (I) and (IIT) are always satisfied by CL and AL. 

Consider the metric spaces X = {A,B,C} with metric given by the edge lengths {4,3,5} and 
Y = (A’,B’,C’) with metric given by the edge lengths {4,3,2}, as given in Figure 19. Obviously, the 
map from X to Y with 0(A) = A’, 0(B) = B’ and 0(C) = C' is s.t. 


dy (o(x),0(x')) < dy (x,x’) for all x,x’ € {A,B,C}. 
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W~ "(uy ) 


Figure 17: Interpretation of Condition II: Assume that 0: X — Y is a distance non-increasing map 
such that ġ¢(x1) = 0(x2) = y1, 0(x3) = y2 and ¢(x4) = y3. Then, Condition (ID requires 
that if x,x’ € X are merged into the same cluster of Y~! (uy) at parameter value t, then 
(x) and o(x’) must merge into the same cluster of Y~! (uy ) for some parameter value < 
t. In the Figure, this translates into the condition that vertical dotted lines corresponding 
to mergings of pairs of points in X should happen at parameter values greater than or 
equal than the parameter values for which correponding points in Y (via ol are merged 
into the same cluster. For example, (x; ), (x2) merge into the same cluster at parameter 
value 0. The condition is clearly verified for this pair since by definition of 0, ¢(x1) = 
(x2) = yı. Take now x3 and x4: clearly the vertical line that shows the parameter value 
for which they merge is to the right of the vertical line showing the parameter value for 
which y2 = (x3) and va = (x4) merge. 


It is easy to check that 


A B C A BC 

A/0 5 3 A/O 2 4 

((ux))= B| 5 0 5] and (uy) = BI 0 4 
C\3 5 0 CNA 4 O 


Note that for example 3 = ux(A,C) < uy(0(A),0(C)) = uy(A’,C’) = 4 thus violating property 
(II). The same construction yields a counter-example for average linkage. 
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sep(X, d) 


Figure 18: Interpretation of Condition III: The vertical line at parameter value t = sep(X,d) must 
intersect the horizontal lines of the dendrogram before any two points are merged. 


5. Metric Stability and Convergence of "3" 


The Proposition and Theorem below assert the metric stability and consistency/convergence of the 
method GC" (i.e., of SLHC, by virtue of Proposition 14. We use the notion of Gromov-Hausdorff 
distance between metric spaces (Burago et al., 2001). This notion of distance permits regarding the 
collection of all compact metric spaces as a metric space in itself. 


This seemingly abstract construction is in fact very useful. Finite metric spaces are by now 
ubiquitous in virtually all areas of data analysis, and the idea of assigning a metric to the collection 
of all of them is in fact quite an old one. For Euclidean metric spaces, for example, the idea of 
constructing a metric was used by Kendall et al. (1999) and Bookstein et al. (1985) in constructing 
a statistical shape theory, motivated by the ideas about form of biological organisms developed by 
D’ Arcy Thompson. 
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Figure 19: An example that shows why complete linkage fails to satisfy condition (2) of Theorem 
18. 


5.1 The Gromov-Hausdorff Distance and Examples 


Definition 21 Let (Z,dz) be a compact metric space. The Hausdorff distance between any two 
compact subsets A,B of Z is defined by 


di, (A,B) := max (max mind (a,b),max min d(a,b) 


Remark 22 Let Z = {z1,...,Z,} C Z. Then, di, (Z,Z) < ð for some 5 = 0 if and only if Z c 
IL, B(zi, 5). In other words, di, (Z,Z) describes the minimal A s.t. Z is a 8-net for Z and therefore 
measures how well Z covers Z. 


The Gromov-Hausdorff distance dg 4 (X,Y) between compact metric spaces LN. ce) and (Y, dy) 
was orignally defined to be the infimal e > 0 s.t. there exists a metric d on X| |Y with d, = dx 
and d, = dy for which the Hausdorff distance between X and Y (as subsets of (X |_|Y,d)) is less 
than € (Gromov, 1987). There is, however, an alternative expression for the GH distance that is 
better suited for our purposes which we now recall. 


Definition 23 (Correspondence) For sets A and B, a subset R C A x B is a correspondence (be- 
tween A and B) if and and only if 


e VaeA, there exists b € B s.t. (a,b) € R 
e V beB, there existsa €X s.t. (a,b) € R 


Let R (A,B) denote the set of all possible correspondences between A and B. 
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We now give several examples to illustrate this definition. 


Example 8 Let A = {a,,a2} and B = {b,,b2,b3}. In this case, Rı = {(a1,b1), (a2, b2), (a1, bail isa 
correspondence but R = {(a1,b1),(a2,b2)$ is not. 


Example 9 Let A and B be finite s.t. #A = #B =n. In this case, if T is any permutation matrix of 
size n, then {(aj;,bz,),i=1,...,n} © R(A,B). 


Example 10 Let 0: X — Y and y: Y — X be given maps. Then, one can construct a correspon- 
dence out of these maps, call it RI. wl given by 


{(x,0(x)), xe X}| (litt, ver) 
For metric spaces (X,dx) and (Y,dy). Let Ty y : X x Y XX xY > R* be given by 
(x,y,27, 9) = |dx (x,x') —dy(,y’) 


Then, by (Burago et al., 2001, Theorem 7.3.25) the Gromov-Hausdorff distance between X and Y 
is equal to i 
X,Y):== inf T SC 12 
de) ! ) 2 RER(X,Y) a n EEE) Ges 
It can be seen (it is an easy computation) that in (12) one can restrict the infimum to those 
correspondences that arise from maps © and w such as those constructed in Example 10. Then, one 
recovers expression (6) which we gave in §3.5, namely, that actually 


l. l . l 
dear (X, Y) ee E (dis(), dis(y),dis(o, wll. (13) 
Remark 24 Expression (13) defines a distance on the set of (isometry classes of) finite metric 
spaces (Burago et al., 2001, Theorem 7.3.30). From now on let G denote the collection of all 
(isometry classes of) compact metric spaces. We say that {(Xn,dx,)}nen © G Gromov-Hausdorff 
converges to X € G if and only if dgs (Xn X) —>Oasnt o, 


Example 11 Fix (X,dx) e G. Consider the sequence {(X,+-dy)}nen € G. Then, X, Gromov- 
Hausdorff converges to the metric space consisting of a single point. 


Remark 25 (Gromovy-Hausdorff distance and Hausdorff distance) Let (X ,dy ) be a compact met- 
ric space. Then, if X' c X is compact and we endow XI with the metric dy: equal to the restriction 
of dx, then 

dga((X,dx), LN. dy’)) < då; (AN ; 


This is easy to see by defining the correspondence R between X and XT given by 
R= {(x x), x EX’ Uf (xx), xE V(x"), xL EX}, 


where V(x’) := {xE X, dx (x,x’) < dx (x,z), zE X\{x }}. Indeed, since then, for all (x1,x/,), (x2,x5) € 
R, 


(dy (x1,x1) + dx Ur, AG ll < max min dy (x, x") = IN AT). 


] l 
-jd — dy (x1, Alle: 
5 x (%1,X2) x (x1;X2)| 2 xEX x'EX’ 
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Example 12 Consider a finite set M and dd : M x M — R* two metrics on M. Then, the GH 
distance between (M,d) and (UM. d'V is bounded above by the L® norm of the difference between d 
and d': 


1 
dear UM, dl (M,d’)) < zld d lro mxm): 


To prove this it is enough to consider the correspondence R € R(M,M) given by R = {(m,m),m € 
M}. 

Notice that as an application, for the metric spaces (P,dzz) and LP dae ) discussed in $3.6, one 
has that 


dgat((P,dz,),(P,diz)) < — and 


E€ 
2 
< 


NIM 


dgy ((P, da), (P, das )) 


5.2 Stability and Convergence Results 


Our first result states that SL HC is stable in the Gromov-Hausdorff sense and it is a generalization 
of Lemma 15. 


Proposition 26 For any two finite metric spaces (X ,dx ) and (Y,dy ) 
dçy((X,dx),(Y,dy)) > dear" UN del, T*(Y,dy)). 


Remark 27 This Proposition generalizes Lemma 15. Notice for example that in case X and Y are 
finite, they need not have the same number of points. This feature is important in order to be able 
to make sense of situations such as the one depicted in Figure 2 in pp. 1428, where one is trying to 
capture the connectivity (i.e., clustering) properties of an underlying ‘continuous’ space by taking 
finitely (but increasingly) many samples from this space and applying some form of HC to this finite 
set. Theorem 28 below deals with exactly this situation. See Figure 20. 


Let (Z,dz) be a compact metric space. Given a finite index set A and a (finite) collection of 
disjoint compact subsets of Z, {U) }aca, let W4 : A x A > Rt be given by 


(OO min dz(z,z’). 
zeU (©) 
Jeu) 
A metric space (A,d,) arises from this construction, where du = £(W,). We say that (A,d,) is 
the metric space with underlying set A arising from {U‘)},<4. Notice that sep(A,d4) equals the 
minimal separation between any two sets U and U (œ) (a oi). More precisely, 


sep(A,da) = min min dz(z,z’). 
a,a’EA, zey œ) 
AQ! ey) 
We now state a metric stability and convergence result, see Figure 20. The proof of this result 
is deferred to §B. 


Theorem 28 Assume (Z,dz) is a compact metric space. Let X and KI be any two finite subsets of 


Z and let dx = dzi; „x and dy = dein Write T*(X del = (X,uy) and T*(X', dy’) = (X’',uy'). 
Then, 
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W, 
Wi3< “23< “a3t Was "12 





X A= daaa a 


Figure 20: Illustration of Theorem 28. Top: A space Z composed of 3 disjoint path connected 
parts, Z1), Z) and Z8). The black dots are the points in the finite sample X. In the 
figure, wi; = W (i,j), 1 <i # j <3. Bottom Left: The dendrogram representation of 
LN. xl. Bottom Right The dendrogram representation of (Z, uz). Note that dz(z1,z2) = 
w13 + w23, dz(z1,z3) = w13 and dz(z2,z3) = w23. As r —> 0, (X,ux) — (Z,uz) in the 
Gromov-Hausdorff sense, see text for details. 


1. (Finite Stability) dgy ((X,ux), ONT, Nell < d% (X,Z) +4% (X',Z). 


2. (Approximation bound) Assume in addition that Z = |_|y-, Z% where A isa finite index set 
and Z®) are compact, disjoint and path-connected sets. Let (A,d4) be the finite metric space 
with underlying set A arising from {Z \yeq. Let T*(A,d4) = (A,u4). Then, if di, (X,Z) < 
sep(A, da) /2, 

dea ((X, ux ), (A, ua )) < di, (X,Z) . 


3. (Convergence) Under the hypotheses of (2), let {Xn}nen be a sequence of finite subsets of Z 
S.t. d (Xn, Z) — 0 as n — œ, and dy, be the metric on X, given by the restriction of dz to 
Xn X Xn. Then, one has that 


dear E" (Xn, dx,),(A,ua)) > O as n > œ. 


Remark 29 (Interpretation of the statement) Assertion (1) guarantees that if X ,X' are both dense 
samples of Z, then the result of applying &* to both sets are very close in the Gromov-Hausdorff 
sense. 

Assertions (2) and (3) identify the limiting behavior of the construction "Un, dx las Xn be- 
comes denser and denser in X, see Figure 20. 
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5.3 A Probabilistic Convergence Result 


In this section, we prove a precise result which describes how the dendrograms attached to compact 
metric spaces by single linkage clustering can be obtained as the limits of the dendrograms attached 
to finite subsets of the metric space. The result is by necessity probabilistic in nature. This kind of 
result 1s of great importance, since we are often interested in infinite metric spaces but typically do 
not have access to more than finitely many random samples from the metric space. 

Theorem 30 and Corollary 32 below proves that for random i.i.d. observations X, = {x1,...,Xn} 
with probability distribution u compactly supported in a metric space (X,d), the result (Xp, ux, ) 
of applying single linkage clustering to (X,,,d) converges almost surely in the Gromov-Hausdorff 
sense to an ultrametric space that recovers the multiscale structure of the support of u, see Figure 
20. This is a refinement of a previous observation of Hartigan (1985) that SLHC is insensitive to 
the distribution of mass of u in its support. 

The proof of this theorem relies on Theorem 34, a probabilistic covering theorem of independent 
interest. In order to state and prove our theorems we make use of the formalism of metric measure 
spaces. 

A triple LN. de, ux), where (X,dy) is a metric space and uy is a Borel probability measure on 
X with compact support will be called an mm-space (short for measure metric space). The support 
supp |ux | of a measure uy on X is the minimal closed set A (w.r.t. inclusion) s.t. uy (X\A) = 0. 
Measure metric spaces are considered in the work of Gromov and are useful in different contexts, 
see (Gromov, 2007, Chapter 35). For a mm-space X let fy RI — R* be defined by 


I > By (x,r)). 
TS ag eet) 


Note also that by construction fx(-) in non-decreasing and fx(r) > 0 for all r > 0. Let also Fy : 
N x Rt — R" be defined by (n,5) — oy Note that for fixed ën > 0, (1) Fx(-, o) is decreasing 
in its argument, and (2) X „en Fx (n, 50) < ©. 


Theorem 30 Let (Z,dz,uz) be a mm-space and write supp |uz| = Ue, U (°) for a finite index set 
A and U = {U® Lea a collection of disjoint, compact, path-connected subsets of Z. Let (A,d4) be 
the metric space arising from U and let 4 := sep(A,da)/2. 

For each n€ N, let Zn = {21,22,---;2n} be a collection of n independent random variables 
(defined on some probability space Q with values in Z) with distribution uz, and let dz, be the 
restriction of dz to Zn x Zn. Then, for GC > O andneN, 


P DEE (A,da)) > c) < Fz (n, min(¢, 84 /2)). 


Corollary 31 Under the hypotheses of Theorem 30, for any pre-specified probability level p € (0,1) 
and tolerance © > 0, if 


In In fx (8/4) 
fx (8/4) | 


n > 
then D (dgs (T* (Zna dz,), T*(A,da)) < 9 > p, where 6 := min(C, 64/2). 
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Corollary 32 Under the hypotheses of Theorem 30, &*(Zn,dz,) — *(A,d4) in the Gromov- 
Hausdorff sense uz-almost surely. 


Proof [Proof of Corollary 32] The proof follows immediately from the expression for Fy and the 
Borel-Cantelli Lemma. E 


Remark 33 Note that the convergence theorem above implies that in the limit, Z*(X,„,dy,) only 
retains information about the support of the probability measure but not about the way the mass is 
distributed inside the support, compare to Hartigan (1985). 


Example 13 (Z c RÍ) Let p : R! — RI be a density function with compact, support Z and u be its 
associated probability measure. Then (R$, ||- ||, u) satisfies the assumptions in the theorem. If one 
makes additional smoothness assumptions on p, in this particular case one can relate Fin. Cl to 
geometrical properties of the boundary of supp |p|. 


Example 14 (Z is a Riemannian manifold) In more generality, Z could be a Riemannian manifold 
and u a probability measure absolutely continuous w.r.t. to the Riemannian area measure on Z. 


6. Discussion 


We have obtained novel characterization, stability and convergence theorems for SL HC. Our theo- 
rems contemplate both the deterministic and the stochastic case. Our characterization theorem can 
be interpreted as a relaxation of Kleinberg’s impossibility result for standard clustering methods 
in that by allowing the output of clustering methods to be hierarchical, one obtains existence and 
uniqueness. 

Our stability results seem to be novel and complement classical observations that CL and AL 
are discontinuous as maps from finite metric spaces into dendrograms. 

Our convergence results also seem to be novel and they refine a previous observation by Hartigan 
about the information retained about an underlying density by SL clustering of an 1.1.d. collection 
of samples from that density. Our setting for the stochastic convergence results is quite general in 
that we do not assume the underlying space to be a smooth manifold and we do not assume the 
underlying probability measure to have a density with respect to any reference measure. 

We understand that SL HC is not sensitive to variations in the density (see also Hartigan, 1981). 
In our future work we will be looking at ways of further relaxing the notions of clustering that can 
cope with the problem of detecting “dense” clusters, in the same spirit as Wishart (1969); Stuetzle 
(2003). A follow up paper (Carlsson and Mémoli, 2009) presents a systematic treatment of this with 
a more general framework. 

Some recent works have also addressed the characterization of clustering schemes in the hierar- 
chical case. The authors of the present paper reported a characterization for proximity dendrograms 
(Carlsson and Mémoli, 2008) using the language of category theory. Zadeh and Ben-David (2009) 
gave a characterization for threshold dendrograms.’ More classical is the work of Jardine and Sibson 
(1971) who also ultimately view HC methods as maps form finite metric spaces to finite ultrametric 
spaces. 


7. Recall that the difference between these two types of dendrograms is that proximity dendrograms retain the linkage 
value at which mergings take place whereas threshold dendrograms only record the order, see Remark 3. 
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It is interesting to consider the situation when one requires the map 9 in our characterization 
theorem (Theorem 18) to be 1 to 1 on points. In this case, a much wider class of hierarchical 
schmemes becomes possible including for example a certain version of clique clustering. The 
restriction on the nature of ọ would be called restriction of functoriality by a mathematician. The 
classification question of clustering methods that arises becomes mathematically interesting and we 
are currently exploring it (Carlsson and Mémoli, Stanford, 2009; Carlsson and Mémoli, 2008). 
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Appendix A. Notation 


Symbol 


NN 


u 
C(X 
U(X) 
P(X) 

N, B, A 

~, [a], A\ ~ 


EE 
L(W) 

0 : [0,00) > P(X) 
D(X) 

Q* 

PL hee CHL, 
EN gal OT 
Ss 

g* 

Ug 

Qu 

SE 

An 


Meaning 


Real numbers. 

d-dimensional Euclidean space. 

Natural numbers. 

A square symmetric matrix with elements a;; which are usually distances. 
Metric space X with metric d, page 1429. 

Ultrametric space X with ultrametric u, page 1429. 

Collection of all finite (resp. n point) metric spaces, page 1429. 

Collection of all finite (resp. n point) ultrametric spaces, page 1429. 
Collection of all non-empty subsets of the set X, page 1429. 

Collection of all ultrametrics over the finite set X, page 1429. 

Collection of all partitions of the finite set X, page 1429. 

A partition of a finite set and blocks of that partition, respectively, page 1429. 
An equivalence relation, the equivalence class of a point and the quotient space, 
page 1429. 

An equivalence relation with a parameter r > 0, page 1429. 

Sphere of radius r and dimension k — 1 embedded in RI. page 1429. 
Maximal metric < W, page 1429. 

A dendrogram over the finite set X, 1431. 

Collection of all dendrograms over the finite set X, page 1431. 

Dendrogram over the finite set X arising from ~,, 1433. 

Linkage functions, page 1434. 

Dendrograms arising from linkage functions, 1434. 

A hierarchical clustering method seen as a map $ : X — U, page 1442. 

A HC method arising from the maximal sub-dominant ultrametric, page 1442. 
An ultrametric obtained from the dendrogram 0, page 1440. 

A dendrogram obtained from the ultrametric u, page 1441. 

A bijective map between D(X) and U(X), page 1439. 

Metric space isometric to an n point unit simplex, page 1447. 
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Ly Metric space isometric to n points on a line, page 1447. 

d Hausdorff distance between subsets of the metric space Z, page 1455. 
E Standard linkage based HC methods seen as maps from X to U, page 1442. 
dis( f), dis(f,g) ` Distortion of a map f and joint distortion of a pair of maps f and g, page 1445. 


dex Gromov-Hausdorff distance between metric spaces, pages 1445, 1456. 

sep(X ) Separation of the metric space X, page 1429. 

diam (X) Diameter of the metric space X, page 1429. 

FP All the n! permutations of elements of the set {1,...,n}. 

Ixy A function used to measure metric distortion, page 1456. 

(X,d, u) An mm-space, (X,d) a compact metric space, u a Borel probability measure, 
page 1459. 

supp |u| Support of the probability measure u, page 1459. 

P, Probability with respect to the law u. 


Appendix B. Proofs 


Proof [Proof of Proposition 8] The claim follows from the following claim, which we prove by 
induction on i: 

Claim: For all i > 2, x,x’ € X are s.t. there exists B € ©; with x,x’ € B if and only if x ~r, x. 
Proof [Proof of the Claim] For i = 2 the claim is clearly true. Fix i > 2. 

Assume that x,x’ € X and Bet are such that x,x’ € B. If x,x belong to the same block 
of ©; there is nothing to prove. So, assume that x € A and x € A’ with A + A’ and 2 A’ € ©;. 
Then, it must be that there exist blocks 4 = 41, æ, ..., A; = A’ of O; s.t. H(A, A41) < R; for 
t=1,...,s—1. Pick x1,y1 € A1, me A, ..., Xg, Ys E Ay S.t. xy = x and y, = x and d (y1, Xt+1) = 
O"(A,,A,+1) < R; fort =1,...,s—1, see the Figure 21. 





Figure 21: Construction used in the proof of Proposition 8. 


Notice that by the inductive hypothesis we have x; ~r,_, ys fort =1,...,s. It follows that x ~, x’ 
for r = max(R;,R;_-1). By Proposition 5, r = R; and hence x ~p, x’. 

Assume now that x ~p, x’. If x,x’ belong to the same block of ©; there’s nothing to prove since 
®;41 is coarser than ©; and hence x,x’ will also belong to the same block of ©;,;. Assume then 
that xe B and x’ e B’ for B, B' e ©; with B # B’. Let x = x1,X2,...,Xs =x’ be points in X with 
dn) fort = 1,...,s—1. Also, for t = 1,...,s—1 let B, be the block of ©; to which x; 
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belongs. But then, by construction 


Ri 240,441) 2 min d(z,7)=6°(B,B41) fort =1,...,s—1, 
ZEB, eh, 


and hence Bı ~¢si_ pR, Bs. In particular, B, O B, c A for some A € O;+; and thus x,x’ belong to the 
same block in ©j,1. Gi 


Proof [Proof of Lemma 10] Obviously ug is non-negative. Pick x,x’,x” € X and let r},r2 > 0 be 
s.t. x,x’ belong to the same block of Din) and x’,x” belong to the same block of 0(r2). These 
numbers clearly exist by condition (2) in the definition of dendrograms. Then, there exist a block 
B of @(max(r},r2)) s.t. x,x” € B and hence ug(x,x”) < max(rı,r2). The conclusion follows since 
rı > ue(x,x’) and rp > ue(x',x”) are arbitrary. 

Now, let x,x’ € X be such that ug(x,x’) = 0. Then x,x’ are in the same block of 6(0). Condition 
(1) in the definition of dendrograms implies that x = x’. E 


Proof [Proof of Lemma 12] Pick x,x’ € X and let r := ug«(x,x’). Then, according to (3), there 
exist x0, X1,- .., Xy E X with x9 = x, xy =x and max;d (xi, xi+1) < r. From (4) we conclude that then 
u* (x, x) < ras well. Assume now that u*(x,x’) < r and let mm... ek be s.t. x9 = x, zs Al 
and max; d (xi, Xi+1) <r. Then, x ~, x’ and hence again by recalling (3), ug» (x,x’) < r. This finishes 
the proof. W 


Proof [Proof of Theorem 18] Pick (X,d) € X. Write T(X,d) = (X,u) and TU. dizs (X,u*). 
(A) We prove that u*(x,x’) > u(x,x’) for all x,x’ € X. Pick x,x’ € X and let 6 := u*(x,x’). Let 
X=XQ,.--,Xn = X' be s.t. 


max de, ils u* (x, x) = A 


Consider the two point metric space (Z,e) := ({p,q}, (3 St Fix i€ {0,...,n— 1}. Consider 
d: {p,q} — X given by p > xi and q > xi+1. By condition (I) we have {(Zs) = Zs. Note that 
0 =e(p,q) > d(o(p),0(qg)) = d (xi, xi+1) and hence by condition (ID, 


Ò > u(Xi, Xi+1). 


Then, since i was arbitrary, we obtain 6 > max;u(x;,x;+1). Now, since u is an ultrametric on X, 

we know that max; u(x;,uj+1) > u(x,x’) and hence 6 > u(x,x’). 
(B) We prove that u* (x,x’) < u(x,x’) for all x,x' € X. Fix r > 0. Let (X;,,d,) be the metric space with 
underlying set X, given by the equivalence classes of X under the relation x ~, x’. Let 0, : X > X, 
be given by x > [x], where [x], denotes the equivalence class of x under ~,. Let d, : X, x X, > R+ 
be given by 
daz = min d(x,x’) 

xed (z) 

x Eo, G 
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and let d, = L(d,). Note that, by our construction, d, is such that for all x,x’ € X, 


d(x, x’) > d-(Or(X), Or(2’)). 


Indeed, assume the contrary. Then for some x,x’ € X one has that d(x,x") < d-(,(x),0,(%’)). 
from the definition of d, it follows that d(x,x’) < d,(0,(x),0,(x’ )) < d,(o,(x),0,(x! 
min{d (x, x), s.t. x ~, x;X ~, x}. This is a contradiction since x ~, x and x’ ~, x’. 

Write T(X,,d,) = (X;,u,). Then, by condition (IID, 


But, 
de 


u(x,x’) > uy (0,(x),0-(x')) (14) 
for all x,x’ € X. Note that 
sep(X,,d,) >r. (15) 


Indeed, for otherwise, there would be two points x,x € X with [x], dl, and r > d(x,x’) = 
u*(x,x’). But this gives a contradiction by Remark 13. 

Claim: u*(x,x') > r implies that u,(,(x),0,(x')) >r 

Assuming the claim, let x,x’ € X be s.t. u*(x,x’) > r, then by Equation (14), 


u(x,x’) > u(x), p xN) >r 
That is, we have obtained that for any r > 0, 
E stu aa 7) ed ar STUEN ET, 


which implies that u* (x,x’) < u(x,x’) for all x, x € X. 
Proof of the claim. Let x,x' € X be s.t. u*(x,x’) >r. Then, [x], # [x’],. By definition of ọ,, also, 
,(x) 4 0,(x’) and hence, by condition (NT) and Equation (15): 


ur(-(x); $- )) > sep(X,,d,) > r 
E 


Proof [Proof of Proposition 26] Write GT. del = (X,ux) and T*(Y,dy) = (Y,uy). Let n = 
dga((X,dx), (Y,dy)) and RE R(X,Y) s.t. |dy(x,x’) — dy(y,y’)| < 21 for all (x,y), (x,y) € R. 
Fix (x,y) and (x’,y’) € R. Let xo,- .-,Xm E X be s.t. x9 =X, Xm =x’ and dy (Xi, Xi+1) < ux (x, x’) for all 
i=0,...,m— 1. Lety=yo,y1,---,¥m—1,¥m = Y EY be s.t. (x;, yi) € R for all i =0,...,m (this is le 
sible by definition of R). Then, dy (yi, Yi+1) < dy (Xi, Xi+1) +N < ux (x, x’) +7 for all i = 0,. -1 
and hence uy (y, y’) < ux (x, x ) + 2N. By exchanging the roles of X and Y one obtains the Sege 
ux (x,x’) < uy(y,y’) +21. This means |ux (x,x’) — uy(y,y’)| < 27. Since (x,y), (x,y) € R are arbi- 
trary, and upon recalling the expression of the Gromov-Hausdorff distance given by (12) we obtain 
the desired conclusion. E 


Proof [Proof of Theorem 28] By Proposition 26 and the triangle inequality for the Gromov-Hausdorff 
distance, 


dga(X,Z) + dg (X’,Z) > dga((X,ux)), LN. Mel), 
Now, (1) follows from Remark 25. 
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We now prove the second claim. Let 6 > 0 be s.t. ming.g Wa (Q, B) > ò. For each z € Z let a(z) 
denote the index of the path connected component of Z s.t. z € Z(*). Since r := d%,(X,Z) < ò it 


is clear that # Glo nX) zl for all oe A. It follows that R = {(x, a(x))|x € X} belongs to R(X,A). 
We prove below that for all xv € X, 


DOGO) E 


By putting (I) and (II) together we will have dgy;((X,ux),(A,ua)) < r. 
Let’s prove (1). It follows immediately from the definition of d4 and W, that for all y,y’ € X, 


Wa (ay), ov ll < dx(y,y’). 


From the definition of d4 it also follows that W4(a, a’) > d4(a,a’) for all a,a’ € A. Then, in 
order to prove (I) pick xo, . . - ,Xm in X with xo = x, Xm =x and max; dy (xi, Xi+1) < ux (x,x’). Consider 
the points in A given by 

alx) = (x9), O(x1),..., (Xm) = A(x’). 
Then, 
da (O(xi), Oe) < Watgen ll < del, Xi+1) < Ux (x, x) 


for i = 0,...,m—1 by the observations above. Then, max; d,4(O(x;),O(xj41)) < dx(x,x’) and by 
recalling the definition of w4(O(x),0(x’)) we obtain (1). 

We now prove (II). Assume first that a(x) = a(x’) = o. Fix €9 > 0 small. Let y: [0, 1] — Z™ be 
a continuous path s.t. y(0) = x and y(1) = x’. Let z1,- ..,Zm be points on image(y) s.t. zo =X, Zm =x’ 
and dy (zj,Zi41) < £o, i = 0,...,m— 1. By hypothesis, one can find x = x0, X1,- .-,Xm—1;Xm =X’ S.t. 
dz (Xj, Z;) < r. Thus, 

max dy (Xi, Xi+1) < Ent 2r 
l 


and hence uy (x,x’) < £o + 2r. Let £ọ — 0 to obtain the desired result. 
Now if & = a(x) 4 a(x’) = B, let &,01,...,07 EA be s.t. % = A(x), Oy = a(x’) and d4(Oj,0j;41) < 
ua(O,B) for 7 =0,...,/—1. 
By definition of d4, for each j = O,...,/— 1 one can find a chain 
(rj) 


Gs Cer ae s.t. ol = Da 0077 = DA 


and 


heel 


>, Wa (aP, aft) = da (Qj, Gaul < ua (Q, B). 
i=0 


Since Wa takes non-negative values, then, for fixed j €e {0,...,/ —1}, it follows that 


(i+1) 


(i) 
Wa(Qui’, Qu: 


; ) <ua(O,B) forall i= 0,...,r;— 1. 


Consider the chain C = {@o,...,@s} in A joining @ to B given by the concatenation of all the 
C;. By eliminating repeated consecutive elements in C, if necessary, one can assume that Q; # Oe, 
By construction W4 (@;, Qi+1) < ua (&, B) for ie {0,...,s—1}, and Qo = a, a, = B. We will now 
lift C into a chain in Z joining x to x’. Note that by compactness, for all v, u € A, v # u there exist 
Cie ZO) and 4, € Z™ s.t. Wa (v, u) = da CORSA 
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Consider the chain G in Z given by 


à â Q / 
G= {x 7 a eet ae 


Ol ,O1 ? “O91?” ? “Os — 1, Oy’ 


For each point g € G c Z pick a point x(g) € X s.t. dz(g,x(g)) < r. Note that this is possible by 
definition of r and also, that x(g) € Z\)) since r < 8/2. 
Let G’ = {xo,x1,..-,%;} be the resulting path in X. Notice that if a(x) A ol) then 


dx (Xk, Xk41) < 2r + Wa (a(x), Oil (16) 
by the triangle inequality. Also, by construction, for k € {0,...,t— 1}, 
Wa (O(xx), a(xk+1)) < UA (a, B). (17) 


Now, we claim that 
ux (x,x’) < max Wa (a(x), a(xk+1)) + 2r. (18) 


This claim will follow from (16) and the simple observation that 


ux (X,x°) < max ux (Xk, Xk+1) < MAX dy (Xk, Xk+1) 


which in turn follows from the fact that uy is the ultrametric on X defined by (4), see remarks in 
Example 7. If a(x) = Oil we already proved that ux(xk,Xķ+1) < 2r. If on the other hand 
a(x) # Or) then (18) holds. Hence, we have that without restriction, for all x, x’ € X, 


ux (x,x') < max Wa (a(x), A(xk+1)) + 2r. 


and hence the claim. Combine this fact with (17) to conclude the proof of (II). Claim (3) follows 
immediately from (2). 
E 


B.1 The Proof of Theorem 30 
We will make use of the following general covering theorem in the proof of Theorem 30. 


Theorem 34 Let (X,d, u) be an mm-space and Xn = ln... uh a collection of n independent 
random variables (defined on some probability space Q, and with values in X) and identically 
distributed with distribution u. Then, for any ò > 0, 


P, (d$; (Xn, supp [ux]) > 8) < Fx (n, 8). 
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Proof Consider first a fixed point x € supp |ux | and h > 0. Then, since x,,...,x, are i.i.d., for all i, 
P(x; € Bx (x,h)) = u( Bx (x,h)). We then have: 


wv 
ye 
EE 
be 
FR. 
ics 
ES 
Ne 
Ze 
= 
gene 
ee 
Il 


P, (À {x¢ d 
i=1 
=P (A {xi ¢ By a) 


i=l 
P, ({xi ¢ Bx (x,h)}) (by independence) 
=] 


~ 


ux (Bx (x,h)))" 


1— 
1 — fx(h))) 


< (19) 


We now obtain a similar bound for the probability that a ball of radius 6/2 around x is within 6 
of a point in Su, Notice that the following inclusion of events holds: 


[axt 6/2) = aus) > d Ua] (20) 


i=1 i=l 


Indeed, assume that the event {x€ |}, Bx(x;,6/2)} holds. Then, x € By(x;,6/2) for some i € 
{1,...,n}. Pick any x’ € By (x,6/2), then by the triangle inequality, dy (x ,.x;) < dy (x’,x) + dx (x,x;) < 
6/2 +6/2 = ð, thus x’ € By(x;,5). Since x’ is an arbitrary point in By(x,5/2) we are done. Now, 
from (20) and (19) (for h = 6/2) above, we find 


P, (| ax 5/2) ¢ Ue t) < (1— fx (8/2))". (21) 


i=] 


Now, consider a maximal 0/4-packing of supp [ux | by balls with centers {p1,..., pn}. Then, 
clearly, supp [ux | = E Bx (p;,6/2). Such a packing always exists since supp [uy | is assumed to 
be compact (Burago et al., 2001). Notice that N, the cardinality of the packing, can be bounded by 
1/fx (0/4). Indeed, since By (pq, 6/4) © Bx (pg, 5/4) = Ø for a 4 B, we have 


1 = ux (supp [ux]) = dE 


S 
ea 
C = 


ke 
| 
— 


Bx (pj,9/ ») 


W 

s 
A 
C z 


<S. 
| 
— 


N 


= D Hx (Bx (Pj,8/4)) 


j=1 


N - fx (0/4) 


W 
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and the claim follows. Now, we finish the proof by first noting that since X„, c supp [x], the 
following inclusion of events holds: 


WI supp [ux]) > 8} = fx ¢ Uaxts8)} 


i=] 


and hence, using the union bound, then (21) and the bound on N, we find: 


AN 


P, (dyr (Xn, supp [ux ]) > 8) 


P, (x £ Usus) 


= dë In (p;,8/2) $ LJ Bx Xi, H 


j=1 


< N. cet P, DEED ENEE 


(oe i=1 
1 
fx (8/4) 

1 
fx (0/4) 
1 
fx (8/4) 
= Fy (n, oi 


AN 


(= fx (8/2))” 


AN 


-(1 — fx(6/4))” (since fx (-) is non-decreasing) 


AN 


e"Ix(8/4) (by the inequality (l—t)<e"',VteR) 


thus concluding the proof. E 


Proof [Proof of Theorem 30] For each n € N, introduce the random variables r, := d4, (Zn, supp [uz]) 
and gn := dear (T*(Zn,dz,), Z*(A,da)). Fix C’ = 84/2. Note that by Theorem 28 (2) once r, < E 
for some LSC we know that g, < rn a.s. Hence, we have 


P(gn > ©) < P (rn > ©) < Fx (n, 6), (22) 


where the last inequality follows from Lemma 34. 
Meanwhile, if ¢ > C is arbitrary, then P(g, > C) < P(g, > Ẹ'). By (22) (for ¢ = C’) we find 
Pio, >) < P (r, >) < Fx (n, 6’) for all 6 > C’. Thus, we have found that 


Fein OU) for >C. 
P(e, >0) d EE E 


The conclusion now follows. E 
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